Code
library(tidyverse)
library(here)
library(tidytext)
library(pdftools)
library(ggwordcloud)
library(textdata)Olivia Hemond
February 17, 2024
This analysis uses the text from the book “Harry Potter and the Prisoner of Azkaban” by J.K. Rowling. The story follows Harry through his third year at Hogwarts, as he learns to fight Dementors, sneak around using the Marauder’s Map, and even travel in time! By the end of the story, Harry learns a lot about the mysterious escaped criminal Sirius Black and about his own parents’ past.
Source: Rowling, J. K. Harry Potter and the Prisoner of Azkaban. New York: Arthur A. Levine Books, 1999. Full text here.
This analysis had two main goals:
Find and visualize the most common words within each chapter and in the book as a whole
Perform a sentiment analysis to visualize the tone (positive or negative) of each chapter of the book
Import & Tidy Text
Prepare the data
Read in the complete PDF of Harry Potter and the Prisoner of Azkaban
Add column for page number
Create character strings for each line of text in the whole book
Let each line of text be its own row in the dataframe
Remove extra whitespaces
Add column for chapter number (in numeric format)
Extract every individual word
Remove stop words
Stop words are commonly used words that don’t carry significant meaning (e.g., “of”, “a”, “the”, “and”)
Remove stop words from this dataset
Find Most-Used Words
Calculate the five most-used words in each chapter and visualize them
Graph the frequency of different character names being mentioned throughout the book
Find the top 100 most-used words in the entire book and visualize them as a word cloud
Perform Sentiment Analysis
Use the “afinn” lexicon to assign each word a value on the positive/negative scale
For each chapter, take a weighted average of the positivity/negativity scores (weighted by the amount of times each word was used)
Visualize how the tone of the book changes in each chapter
### Wrangle and tidy
hp3_lines <- data.frame(hp3_text) %>%
mutate(page = 1:n() - 1) %>%
mutate(text_full = str_split(hp3_text, pattern = '\\n')) %>% # creates character strings of each line
unnest(text_full) %>% # make row for each line of text
select(!hp3_text) %>% # don't need original data column anymore
mutate(text_full = str_squish(text_full)) # remove any extra whitespace### Add chapters in separate column
hp3_chapters <- hp3_lines %>%
slice(-1) %>% # remove empty first row
mutate(chapter = ifelse(str_detect(text_full, "CHAPTER"), text_full, NA)) %>% # creates new chapter column
mutate(chapter = str_remove(chapter, "CHAPTER")) %>% # remove the word "chapter"
fill(chapter, .direction = 'down') %>% # assign chapter to all entries in each chapter
mutate(chapter_num = case_when(
chapter == " ONE" ~ 1,
chapter == " TWO" ~ 2,
chapter == " THREE" ~ 3,
chapter == " FOUR" ~ 4,
chapter == " FIVE" ~ 5,
chapter == " SIX" ~ 6,
chapter == " SEVEN" ~ 7,
chapter == " EIGHT" ~ 8,
chapter == " NINE" ~ 9,
chapter == " TEN" ~ 10,
chapter == " ELEVEN" ~ 11,
chapter == " TWELVE" ~ 12,
chapter == " THIRTEEN" ~ 13,
chapter == " FOURTEEN" ~ 14,
chapter == " FIFTEEN" ~ 15,
chapter == " SIXTEEN" ~ 16,
chapter == " SEVENTEEN" ~ 17,
chapter == " EIGHTEEN" ~ 18,
chapter == " NINETEEN" ~ 19,
chapter == " TWENTY" ~ 20,
chapter == " TWENTY-ONE" ~ 21,
chapter == " TWENTY-TWO" ~ 22
)) # change written chapter numbers into actual numbers### Top 5 words for each chapter
top_5_words <- hp3_wordcount_clean %>%
group_by(chapter_num) %>%
arrange(-n) %>%
slice(1:5) %>%
ungroup()
### Plot
ggplot(top_5_words, aes(x = n, y = word)) +
geom_col(fill = "#740001") +
facet_wrap(~as.factor(chapter_num), scales = "free") +
labs(x = "", y = "") +
theme_minimal()Many of the most used words are the names of characters, which makes sense given it’s a book with a lot of dialogue and third-person narration. Using character names as a proxy for their relevance in any given chapter, we can track how certain characters appear / disappear from the narrative:
### Look at some key characters over the course of the book
hp3_character_count <- hp3_wordcount_clean %>%
filter(word %in% c("hagrid", "lupin", "buckbeak", "snape", "pettigrew", "sirius"))
hp3_words_by_chap <- hp3_wordcount_clean %>%
group_by(chapter_num) %>%
summarize(word_count = sum(n))
hp3_characters_by_chap <- left_join(hp3_character_count, hp3_words_by_chap, by = "chapter_num") %>%
mutate(freq_per_chap = n/word_count)
### Plot
ggplot(hp3_characters_by_chap, aes(x = as.factor(chapter_num), y = freq_per_chap, color = word, group = word)) +
geom_point() +
geom_line() +
labs(x = "", y = "Frequency") +
facet_wrap(~word, scales = "free_y", nrow = 3) +
scale_color_manual(values = c("#740001", "#AE0001", "#EEBA30", "#D3A625", "#000000", "darkgreen")) +
theme_minimal() +
theme(legend.position = "none")### Count the number of words in each chapter assigned to each value (from -5 to 5)
afinn_counts <- hp3_afinn %>%
group_by(chapter_num, value) %>%
summarize(n = n())
### Take a weighted average of the values (using the number of words to weight)
afinn_mean <- afinn_counts %>%
summarize(weighted_avg_value = weighted.mean(value, n))
### Plot
ggplot(data = afinn_mean) +
geom_col(aes(x = as.factor(chapter_num), y = weighted_avg_value, fill = weighted_avg_value > 0)) +
labs(x = "Chapter", y = "Average Word Positivity") +
scale_fill_manual(name = 'Positive?', values = setNames(c('#D3A625','#AE0001'), c(T, F))) +
theme_minimal() +
theme(legend.position = "none") The first portion of the book has some overall positive chapters (like Chapter 4, where Harry returns to school and reunites with his friends) and some more negative chapters (like Chapters 2 and 3, where Harry accidentally inflates his Aunt Marge like a balloon, and then must run away and catch the chaotic Knight Bus). The latter half of the book takes on a much heavier and more negative tone, with the most negative chapter being Chapter 17, where Harry, Ron, and Hermione find themselves caught in the Shrieking Shack amidst a showdown between Sirius Black, Professor Lupin, Professor Snape, and Peter Pettigrew (as the rat Scabbers). The book ultimately ends on a positive note in the final chapter, once Sirius and Buckbeak have safely escaped from their respective death sentences!